Business Scenario: A diamond merchant has come to you for help. They want to create an automated system to predict the apt price of a diamond based on its shape/size/color etc.

They have shared the shape/size/color data of 53940 diamonds.

Your task is to create a machine learning model which can predict the price of a diamond based on its characteristics.

In below case study I will discuss the step by step approach to create a Machine Learning predictive model in such scenarios.

You can use this flow as a template to solve any supervised ML Regression problem!

The flow of the case study is as below:

  • Reading the data in python
  • Defining the problem statement
  • Identifying the Target variable
  • Looking at the distribution of Target variable
  • Basic Data exploration
  • Visual Exploratory Data Analysis for data distribution (Histogram and Barcharts)
  • Feature Selection based on data distribution
  • Outlier treatment
  • Missing Values treatment
  • Visual correlation analysis
  • Statistical correlation analysis (Feature Selection)
  • Converting data to numeric for ML
  • Sampling and K-fold cross validation
  • Trying multiple Regression algorithms
  • Selecting the best Model
  • Deploying the best model in production

I know its a long list!! Take a deep breath... and let us get started!

Reading the data into python

This is one of the most important steps in machine learning! You must understand the data and the domain well before trying to apply any machine learning algorithm.

The data has one file "DiamondpricesData.csv". This file contains 53940 diamond details.

Data description

The business meaning of each column in the data is as below

  • price: The price of the Diamond
  • carat: The carat value of the Diamond
  • cut: The cut type of the Diamond, it determines the shine
  • color: The color value of the Diamond
  • clarity: The carat type of the Diamond
  • depth: The depth value of the Diamond
  • table: Flat facet on its surface — the large, flat surface facet that you can see when you look at the diamond from above.
  • x: Width of the diamond
  • y: Length of the diamond
  • z: Height of the diamond
In [1]:
# Supressing the warning messages
import warnings
warnings.filterwarnings('ignore')
In [2]:
# Reading the dataset
import pandas as pd
import numpy as np
DiamondpricesData=pd.read_csv('/Users/farukh/Python Case Studies/DiamondpricesData.csv', encoding='latin')
print('Shape before deleting duplicate values:', DiamondpricesData.shape)

# Removing duplicate rows if any
DiamondpricesData=DiamondpricesData.drop_duplicates()
print('Shape After deleting duplicate values:', DiamondpricesData.shape)

# Printing sample data
# Start observing the Quantitative/Categorical/Qualitative variables
DiamondpricesData.head(10)
Shape before deleting duplicate values: (53940, 10)
Shape After deleting duplicate values: (53794, 10)
Out[2]:
price carat cut color clarity depth table x y z
0 326 0.23 Ideal E SI2 61.5 55.0 3.95 3.98 2.43
1 326 0.21 Premium E SI1 59.8 61.0 3.89 3.84 2.31
2 327 0.23 Good E VS1 56.9 65.0 4.05 4.07 2.31
3 334 0.29 Premium I VS2 62.4 58.0 4.20 4.23 2.63
4 335 0.31 Good J SI2 63.3 58.0 4.34 4.35 2.75
5 336 0.24 Very Good J VVS2 NaN 57.0 3.94 3.96 2.48
6 336 0.24 Very Good I VVS1 62.3 57.0 3.95 3.98 2.47
7 337 0.26 Very Good H SI1 61.9 55.0 4.07 4.11 2.53
8 337 0.22 Fair E VS2 65.1 61.0 3.87 3.78 2.49
9 338 0.23 Very Good H VS1 59.4 61.0 4.00 4.05 2.39

Defining the problem statement:

Create a ML model which can predict the price of a diamond

  • Target Variable: price
  • Predictors: color, cut, carat etc.

Determining the type of Machine Learning

Based on the problem statement you can understand that we need to create a supervised ML Regression model, as the target variable is Continuous.

Looking at the distribution of Target variable

  • If target variable's distribution is too skewed then the predictive modeling will not be possible.
  • Bell curve is desirable but slightly positive skew or negative skew is also fine
  • When performing Regression, make sure the histogram looks like a bell curve or slight skewed version of it. Otherwise it impacts the Machine Learning algorithms ability to learn all the scenarios.
In [3]:
%matplotlib inline
# Creating Bar chart as the Target variable is Continuous
DiamondpricesData['price'].hist()
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x121144e10>

The data distribution of the target variable is satisfactory to proceed further. There are sufficient number of rows for each type of values to learn from. It is slightly positively skewed. which is acceptable.

Basic Data Exploration

This step is performed to guage the overall data. The volume of data, the types of columns present in the data. Initial assessment of the data should be done to identify which columns are Quantitative, Categorical or Qualitative.

This step helps to start the column rejection process. You must look at each column carefully and ask, does this column affect the values of the Target variable? For example in this case study, you will ask, does this column affect the price of the diamond? If the answer is a clear "No", then remove the column immediately from the data, otherwise keep the column for further analysis.

There are four commands which are used for Basic data exploration in Python

  • head() : This helps to see a few sample rows of the data
  • info() : This provides the summarized information of the data
  • describe() : This provides the descriptive statistical details of the data
  • nunique(): This helps us to identify if a column is categorical or continuous
In [4]:
# Looking at sample rows in the data
DiamondpricesData.head()
Out[4]:
price carat cut color clarity depth table x y z
0 326 0.23 Ideal E SI2 61.5 55.0 3.95 3.98 2.43
1 326 0.21 Premium E SI1 59.8 61.0 3.89 3.84 2.31
2 327 0.23 Good E VS1 56.9 65.0 4.05 4.07 2.31
3 334 0.29 Premium I VS2 62.4 58.0 4.20 4.23 2.63
4 335 0.31 Good J SI2 63.3 58.0 4.34 4.35 2.75
In [5]:
# Observing the summarized information of data
# Data types, Missing values based on number of non-null values Vs total rows etc.
# Remove those variables from data which have too many missing values (Missing Values > 30%)
# Remove Qualitative variables which cannot be used in Machine Learning
DiamondpricesData.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53794 entries, 0 to 53939
Data columns (total 10 columns):
price      53794 non-null int64
carat      53794 non-null float64
cut        53794 non-null object
color      53788 non-null object
clarity    53794 non-null object
depth      53780 non-null float64
table      53794 non-null float64
x          53794 non-null float64
y          53794 non-null float64
z          53794 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.5+ MB
In [6]:
# Looking at the descriptive statistics of the data
DiamondpricesData.describe(include='all')
Out[6]:
price carat cut color clarity depth table x y z
count 53794.000000 53794.00000 53794 53788 53794 53780.000000 53794.000000 53794.000000 53794.000000 53794.000000
unique NaN NaN 5 7 8 NaN NaN NaN NaN NaN
top NaN NaN Ideal G SI1 NaN NaN NaN NaN NaN
freq NaN NaN 21488 11256 13032 NaN NaN NaN NaN NaN
mean 3933.065082 0.79778 NaN NaN NaN 61.748154 57.458109 5.731214 5.734653 3.538714
std 3988.114460 0.47339 NaN NaN NaN 1.429948 2.233679 1.120695 1.141209 0.705037
min 326.000000 0.20000 NaN NaN NaN 43.000000 43.000000 0.000000 0.000000 0.000000
25% 951.000000 0.40000 NaN NaN NaN 61.000000 56.000000 4.710000 4.720000 2.910000
50% 2401.000000 0.70000 NaN NaN NaN 61.800000 57.000000 5.700000 5.710000 3.530000
75% 5326.750000 1.04000 NaN NaN NaN 62.500000 59.000000 6.540000 6.540000 4.030000
max 18823.000000 5.01000 NaN NaN NaN 79.000000 95.000000 10.740000 58.900000 31.800000
In [7]:
# Finging unique values for each column
# TO understand which column is categorical and which one is Continuous
# Typically if the numer of unique values are < 20 then the variable is likely to be a category otherwise continuous
DiamondpricesData.nunique()
Out[7]:
price      11602
carat        273
cut            5
color          7
clarity        8
depth        184
table        127
x            554
y            552
z            375
dtype: int64

Basic Data Exploration Results

Based on the basic exploration above, you can now create a simple report of the data, noting down your observations regaring each column. Hence, creating a initial roadmap for further analysis.

The selected columns in this step are not final, further study will be done and then a final list will be created

  • price: Continuous. Selected. This is the Target Variable!
  • carat: Continuous. Selected.
  • cut: Categorical. Selected.
  • color: Categorical. Selected.
  • clarity: Categorical. Selected.
  • depth: Continuous. Selected.
  • table: Continuous. Selected.
  • x: Continuous. Selected.
  • y: Continuous. Selected.
  • z: Continuous. Selected.
In [ ]:
 

Visual Exploratory Data Analysis

  • Categorical variables: Bar plot
  • Continuous variables: Histogram

Visualize distribution of all the Categorical Predictor variables in the data using bar plots

We can spot a categorical variable in the data by looking at the unique values in them. Typically a categorical variable contains less than 20 Unique values AND there is repetition of values, which means the data can be grouped by those unique values.

Based on the Basic Data Exploration above, we have spotted three categorical predictors in the data

Categorical Predictors: 'cut', 'color', 'clarity'

We use bar charts to see how the data is distributed for these categorical columns.

In [8]:
# Plotting multiple bar charts at once for categorical variables
# Since there is no default function which can plot bar charts for multiple columns at once
# we are defining our own function for the same

def PlotBarCharts(inpData, colsToPlot):
    %matplotlib inline
    
    import matplotlib.pyplot as plt
    
    # Generating multiple subplots
    fig, subPlot=plt.subplots(nrows=1, ncols=len(colsToPlot), figsize=(20,5))
    fig.suptitle('Bar charts of: '+ str(colsToPlot))

    for colName, plotNumber in zip(colsToPlot, range(len(colsToPlot))):
        inpData.groupby(colName).size().plot(kind='bar',ax=subPlot[plotNumber])
In [9]:
#####################################################################
# Calling the function
PlotBarCharts(inpData=DiamondpricesData, colsToPlot=['cut', 'color', 'clarity'])

Bar Charts Interpretation

These bar charts represent the frequencies of each category in the Y-axis and the category names in the X-axis.

In the ideal bar chart each category has comparable frequency. Hence, there are enough rows for each category in the data for the ML algorithm to learn.

If there is a column which shows too skewed distribution where there is only one dominant bar and the other categories are present in very low numbers. These kind of columns may not be very helpful in machine learning. We confirm this in the correlation analysis section and take a final call to select or reject the column.

If the bars are too skewed, like there is just one bar which is dominating and other categories have very less rows or there is just one value only. Such columns are not correlated with the target variable because there is no information to learn. The algorithms cannot find any rule like when the value is this then the target variable is that.

Selected Categorical Variables: In this data, all three categorical variables are selected for further analysis.

'cut', 'color', 'clarity'

In [ ]:
 

Visualize distribution of all the Continuous Predictor variables in the data using histograms

Based on the Basic Data Exploration, Three continuous predictor variables 'ApplicantIncome', 'CoapplicantIncome',and 'LoanAmount'.

In [10]:
# Plotting histograms of multiple columns together
DiamondpricesData.hist(['carat', 'depth', 'table', 'x','y','z'], figsize=(18,10))
Out[10]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x123156ed0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1230fcbd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1230b93d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12306dbd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12158b410>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x121517c10>]],
      dtype=object)

Histogram Interpretation

Histograms shows us the data distribution for a single continuous variable.

The X-axis shows the range of values and Y-axis represent the number of values in that range. For example, in the above histogram of "carat", there are around 25000 rows in data that has a value between 0 to 0.5.

The ideal outcome for histogram is a bell curve or slightly skewed bell curve. If there is too much skewness, then outlier treatment should be done and the column should be re-examined, if that also does not solve the problem then only reject the column.

Selected Continuous Variables:

  • carat : Selected. The distribution is good.
  • table: Selected. The distribution is good.
  • depth: Selected. The distribution is good.
  • x: Selected. Outliers seen near 0, need to treat them.
  • y: Selected. Outliers seen beyond 20, need to treat them.
  • z: Selected. Outliers seen beyond 10, need to treat them.
In [ ]:
 

Outlier treatment

Outliers are extreme values in the data which are far away from most of the values. You can see them as the tails in the histogram.

Outlier must be treated one column at a time. As the treatment will be slightly different for each column.

Why I should treat the outliers?

Outliers bias the training of machine learning models. As the algorithm tries to fit the extreme value, it goes away from majority of the data.

There are below two options to treat outliers in the data.

  • Option-1: Delete the outlier Records. Only if there are just few rows lost.
  • Option-2: Impute the outlier values with a logical business value

Below we are finding out the most logical value to be replaced in place of outliers by looking at the histogram.

Replacing outliers for 'x'

In [11]:
# Finding nearest values to 2 mark
DiamondpricesData['x'][DiamondpricesData['x']>2].sort_values(ascending=True)
Out[11]:
31596     3.73
31600     3.73
31598     3.74
31599     3.76
31601     3.77
         ...  
26444    10.01
25999    10.02
25998    10.14
27630    10.23
27415    10.74
Name: x, Length: 53787, dtype: float64

The above result shows that the nearest logical value is 3.73, hence replacing any values less than that with it.

In [12]:
# Replacing outliers with nearest possibe value
DiamondpricesData['x'][DiamondpricesData['x']<3.73] =3.73

Replacing outliers for 'y'

In [13]:
# Finding nearest values to 20 mark
DiamondpricesData['y'][DiamondpricesData['y']<20].sort_values(ascending=False)
Out[13]:
27415    10.54
27630    10.16
25998    10.10
25999     9.94
26444     9.94
         ...  
27429     0.00
15951     0.00
24520     0.00
49556     0.00
11963     0.00
Name: y, Length: 53792, dtype: float64

Above result shows the nearest logical value is 10.54, hence, replacing any value above 20 with it.

In [14]:
# Replacing outliers with nearest possibe value
DiamondpricesData['y'][DiamondpricesData['y']>20] =10.54

Replacing outliers for 'z'

In [15]:
# Finding nearest values to 10 mark
DiamondpricesData['z'][DiamondpricesData['z']<10].sort_values(ascending=False)
Out[15]:
24067    8.06
27415    6.98
27630    6.72
27130    6.43
23644    6.38
         ... 
27429    0.00
27112    0.00
15951    0.00
4791     0.00
27503    0.00
Name: z, Length: 53793, dtype: float64

Above result shows the nearest logical value is 6.98, hence, replacing any value above 8 with it.

In [16]:
# Replacing outliers with nearest possibe value
DiamondpricesData['z'][DiamondpricesData['z']>8] =6.98

Visualizing distribution after outlier treatment

The distribution has improved after the outlier treatment. There is still a tail but it is thick, that means there are many values in that range, hence, it is acceptable.

In [17]:
# Plotting the histogram
DiamondpricesData.hist(['x','y', 'z'], figsize=(18,8))
Out[17]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x122841450>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1228f46d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12291ee50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1228b4690>]],
      dtype=object)

Outliers seen on the left hand side near zero! Need to treat them as well.

Replacing outliers for 'y' on the left side

In [18]:
# Finding nearest value beyond 2
DiamondpricesData['y'][DiamondpricesData['y']>2].sort_values(ascending=True)
Out[18]:
31600     3.68
31596     3.71
31598     3.71
31601     3.72
31599     3.73
         ...  
25998    10.10
27630    10.16
27415    10.54
49189    10.54
24067    10.54
Name: y, Length: 53788, dtype: float64

Above result shows the nearest logical value is 3.68, hence, replacing any value below 2 with it.

In [19]:
# Replacing outliers with nearest possibe value
DiamondpricesData['y'][DiamondpricesData['y']<2] =3.68

Replacing outliers for 'z' on the left side

In [20]:
# Finding nearest value beyond 2
DiamondpricesData['z'][DiamondpricesData['z']>2].sort_values(ascending=True)
Out[20]:
39246    2.06
31592    2.24
47138    2.25
31591    2.26
14       2.27
         ... 
27130    6.43
27630    6.72
24067    6.98
27415    6.98
48410    6.98
Name: z, Length: 53772, dtype: float64

Above result shows the nearest logical value is 2.06, hence, replacing any value below 2 with it.

In [21]:
# Replacing outliers with nearest possibe value
DiamondpricesData['z'][DiamondpricesData['z']<2] =2.06

Visualizing distribution again after outlier treatment

The distribution has improved after the outlier treatment. The outliers on the left side are treated.

In [22]:
# Plotting the histogram
DiamondpricesData.hist(['x','y', 'z'], figsize=(18,8))
Out[22]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x12360e450>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x123635890>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x123669b90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1236d1350>]],
      dtype=object)
In [ ]:
 

Missing values treatment

Missing values are treated for each column separately.

If a column has more than 30% data missing, then missing value treatment cannot be done. That column must be rejected because too much information is missing.

There are below options for treating missing values in data.

  • Delete the missing value rows if there are only few records
  • Impute the missing values with MEDIAN value for continuous variables
  • Impute the missing values with MODE value for categorical variables
  • Interpolate the values based on nearby values
  • Interpolate the values based on business logic
In [23]:
# Finding how many missing values are there for each column
DiamondpricesData.isnull().sum()
Out[23]:
price       0
carat       0
cut         0
color       6
clarity     0
depth      14
table       0
x           0
y           0
z           0
dtype: int64

I am choosing to replace missing values using Median and Mode values

In [24]:
# Replacing missing value for categorical data using MODE value
DiamondpricesData['color'].fillna(value=DiamondpricesData['color'].mode()[0], inplace=True)
In [25]:
# Replacing missing value for continuous data using median value
DiamondpricesData['depth'].fillna(value=DiamondpricesData['depth'].median(), inplace=True)
In [26]:
# Checking missing values again after treatment
DiamondpricesData.isnull().sum()
Out[26]:
price      0
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
dtype: int64
In [ ]:
 

Feature Selection

Now its time to finally choose the best columns(Features) which are correlated to the Target variable. This can be done directly by measuring the correlation values or ANOVA/Chi-Square tests. However, it is always helpful to visualize the relation between the Target variable and each of the predictors to get a better sense of data.

I have listed below the techniques used for visualizing relationship between two variables as well as measuring the strength statistically.

Visual exploration of relationship between variables

  • Continuous Vs Continuous ---- Scatter Plot
  • Categorical Vs Continuous---- Box Plot
  • Categorical Vs Categorical---- Grouped Bar Plots

Statistical measurement of relationship strength between variables

  • Continuous Vs Continuous ---- Correlation matrix
  • Categorical Vs Continuous---- ANOVA test
  • Categorical Vs Categorical--- Chi-Square test

In this case study the Target variable is Continuous, hence below two scenarios will be present

  • Continuous Target Variable Vs Continuous Predictor
  • Continuous Target Variable Vs Categorical Predictor
In [ ]:
 

Relationship exploration: Continuous Vs Continuous -- Scatter Charts

When the Target variable is continuous and the predictor is also continuous, we can visualize the relationship between the two variables using scatter plot and measure the strength of relation using pearson's correlation value.

In [27]:
ContinuousCols=['carat', 'depth', 'table', 'x','y','z']

# Plotting scatter chart for each predictor vs the target variable
for predictor in ContinuousCols:
    DiamondpricesData.plot.scatter(x=predictor, y='price', figsize=(10,5), title=predictor+" VS "+ 'price')

Scatter charts interpretation

What should you look for in these scatter charts?

Trend. You should try to see if there is a visible trend or not. There could be three scenarios

  1. Increasing Trend: This means both variables are positively correlated. In simpler terms, they are directly proportional to each other, if one value increases, other also increases. This is good for ML!

  2. Decreasing Trend: This means both variables are negatively correlated. In simpler terms, they are inversely proportional to each other, if one value increases, other decreases. This is also good for ML!

  3. No Trend: You cannot see any clear increasing or decreasing trend. This means there is no correlation between the variables. Hence the predictor cannot be used for ML.

Based on this chart you can get a good idea about the predictor, if it will be useful or not. You confirm this by looking at the correlation value.

Outliers seen in the scatter charts

In the above scatter charts you can observe some of the points in x, y and z columns, which are in the lower range. There is a straight line on the left side.

These points will interfere with the model fit.

Hence they must be removed from the data.

In [28]:
# Creating a data filter to remove outliers from data
DataFilter=(DiamondpricesData['z']>2.06) & (DiamondpricesData['z']<6.5)
DiamondpricesData=DiamondpricesData[DataFilter]

Looking at the scatter charts again after outlier removal

In [29]:
ContinuousCols=['carat', 'depth', 'table', 'x','y','z']

# Plotting scatter chart for each predictor vs the target variable
for predictor in ContinuousCols:
    DiamondpricesData.plot.scatter(x=predictor, y='price', figsize=(10,5), title=predictor+" VS "+ 'price')
In [ ]:
 

Statistical Feature Selection (Continuous Vs Continuous) using Correlation value

Pearson's correlation coefficient can simply be calculated as the covariance between two features $x$ and $y$ (numerator) divided by the product of their standard deviations (denominator):

image.png

  • This value can be calculated only between two numeric columns
  • Correlation between [-1,0) means inversely proportional, the scatter plot will show a downward trend
  • Correlation between (0,1] means directly proportional, the scatter plot will show a upward trend
  • Correlation near {0} means No relationship, the scatter plot will show no clear trend.
  • If Correlation value between two variables is > 0.5 in magnitude, it indicates good relationship the sign does not matter
  • We observe the correlations between Target variable and all other predictor variables(s) to check which columns/features/predictors are actually related to the target variable in question
In [30]:
# Calculating correlation matrix
ContinuousCols=['price','carat', 'depth', 'table', 'x','y','z']

# Creating the correlation matrix
CorrelationData=DiamondpricesData[ContinuousCols].corr()
CorrelationData
Out[30]:
price carat depth table x y z
price 1.000000 0.921845 -0.011424 0.126696 0.887063 0.888460 0.882354
carat 0.921845 1.000000 0.027212 0.181277 0.978136 0.977011 0.977179
depth -0.011424 0.027212 1.000000 -0.297608 -0.025531 -0.028635 0.095916
table 0.126696 0.181277 -0.297608 1.000000 0.195519 0.189201 0.154926
x 0.887063 0.978136 -0.025531 0.195519 1.000000 0.998435 0.991574
y 0.888460 0.977011 -0.028635 0.189201 0.998435 1.000000 0.991271
z 0.882354 0.977179 0.095916 0.154926 0.991574 0.991271 1.000000
In [31]:
# Filtering only those columns where absolute correlation > 0.5 with Target Variable
# reduce the 0.5 threshold if no variable is selected like in this case
CorrelationData['price'][abs(CorrelationData['price']) > 0.2 ]
Out[31]:
price    1.000000
carat    0.921845
x        0.887063
y        0.888460
z        0.882354
Name: price, dtype: float64

Final selected Continuous columns:

'carat', 'x','y','z'

In [ ]:
 

Relationship exploration: Categorical Vs Continuous -- Box Plots

When the target variable is Continuous and the predictor variable is Categorical we analyze the relation using Boxplots and measure the strength of relation using Anova test

In [32]:
# Box plots for Categorical Target Variable "price" and continuous predictors
CategoricalColsList=['cut', 'color', 'clarity']

import matplotlib.pyplot as plt
fig, PlotCanvas=plt.subplots(nrows=1, ncols=len(CategoricalColsList), figsize=(18,5))

# Creating box plots for each continuous predictor against the Target Variable "price"
for PredictorCol , i in zip(CategoricalColsList, range(len(CategoricalColsList))):
    DiamondpricesData.boxplot(column='price', by=PredictorCol, figsize=(5,5), vert=True, ax=PlotCanvas[i])

Box-Plots interpretation

What should you look for in these box plots?

These plots gives an idea about the data distribution of continuous predictor in the Y-axis for each of the category in the X-Axis.

If the distribution looks similar for each category(Boxes are in the same line), that means the the continuous variable has NO effect on the target variable. Hence, the variables are not correlated to each other.

On the other hand if the distribution is different for each category(the boxes are not in same line!). It hints that these variables might be correlated with price.

In this data, all three categorical predictors looks correlated with the Target variable.

We confirm this by looking at the results of ANOVA test below

In [ ]:
 

Statistical Feature Selection (Categorical Vs Continuous) using ANOVA test

Analysis of variance(ANOVA) is performed to check if there is any relationship between the given continuous and categorical variable

  • Assumption(H0): There is NO relation between the given variables (i.e. The average(mean) values of the numeric Target variable is same for all the groups in the categorical Predictor variable)
  • ANOVA Test result: Probability of H0 being true
In [33]:
# Defining a function to find the statistical relationship with all the categorical variables
def FunctionAnova(inpData, TargetVariable, CategoricalPredictorList):
    from scipy.stats import f_oneway

    # Creating an empty list of final selected predictors
    SelectedPredictors=[]
    
    print('##### ANOVA Results ##### \n')
    for predictor in CategoricalPredictorList:
        CategoryGroupLists=inpData.groupby(predictor)[TargetVariable].apply(list)
        AnovaResults = f_oneway(*CategoryGroupLists)
        
        # If the ANOVA P-Value is <0.05, that means we reject H0
        if (AnovaResults[1] < 0.05):
            print(predictor, 'is correlated with', TargetVariable, '| P-Value:', AnovaResults[1])
            SelectedPredictors.append(predictor)
        else:
            print(predictor, 'is NOT correlated with', TargetVariable, '| P-Value:', AnovaResults[1])
    
    return(SelectedPredictors)
In [34]:
# Calling the function to check which categorical variables are correlated with target
CategoricalPredictorList=['cut', 'color', 'clarity']
FunctionAnova(inpData=DiamondpricesData, 
              TargetVariable='price', 
              CategoricalPredictorList=CategoricalPredictorList)
##### ANOVA Results ##### 

cut is correlated with price | P-Value: 2.41639003272117e-146
color is correlated with price | P-Value: 0.0
clarity is correlated with price | P-Value: 1.655076560414e-312
Out[34]:
['cut', 'color', 'clarity']

The results of ANOVA confirm our visual analysis using box plots above.

All categorical variables are correlated with the Target variable. This is something we guessed by looking at the box plots!

Final selected Categorical columns:

'cut', 'color', 'clarity'

In [ ]:
 

Selecting final predictors for Machine Learning

Based on the above tests, selecting the final columns for machine learning

In [35]:
SelectedColumns=['carat', 'x','y','z','cut', 'color', 'clarity']

# Selecting final columns
DataForML=DiamondpricesData[SelectedColumns]
DataForML.head()
Out[35]:
carat x y z cut color clarity
0 0.23 3.95 3.98 2.43 Ideal E SI2
1 0.21 3.89 3.84 2.31 Premium E SI1
2 0.23 4.05 4.07 2.31 Good E VS1
3 0.29 4.20 4.23 2.63 Premium I VS2
4 0.31 4.34 4.35 2.75 Good J SI2
In [36]:
# Saving this final data for reference during deployment
DataForML.to_pickle('DataForML.pkl')

Data Pre-processing for Machine Learning

List of steps performed on predictor variables before data can be used for machine learning

  1. Converting each Ordinal Categorical columns to numeric
  2. Converting Binary nominal Categorical columns to numeric using 1/0 mapping
  3. Converting all other nominal categorical columns to numeric using pd.get_dummies()
  4. Data Transformation (Optional): Standardization/Normalization/log/sqrt. Important if you are using distance based algorithms like KNN, or Neural Networks

Converting the ordinal variable to numeric using mapping

In [37]:
# Looking at unique values of ordinal column
DataForML['cut'].unique()
Out[37]:
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
In [38]:
# Replacing the ordinal values
DataForML['cut'].replace({'Good':1, 
                          'Very Good':2,
                          'Fair':3,
                          'Ideal':4,
                          'Premium':5
                         }, inplace=True)

image.png

In [39]:
# Looking at unique values of ordinal column
DataForML['color'].unique()
Out[39]:
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)
In [40]:
# Replacing the ordinal values
DataForML['color'].replace({'J':1, 
                          'I':2,
                          'H':3,
                          'G':4,
                          'F':5,
                          'E':6,
                          'D':7
                         }, inplace=True)
In [ ]:
 

image.png

In [41]:
# Looking at unique values of ordinal column
DataForML['clarity'].unique()
Out[41]:
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
      dtype=object)
In [42]:
# Replacing the ordinal values
DataForML['clarity'].replace({'I1':1,
                          'SI1':2,
                          'SI2':3,
                          'VS1':4,
                          'VS2':5,
                          'VVS1':6,
                          'VVS2':7,
                          'IF':8
                         }, inplace=True)
In [ ]:
 

Converting the binary nominal variable to numeric using 1/0 mapping

No Binary nominal variables in this data

Converting the nominal variable to numeric using get_dummies()

In [43]:
# Treating all the nominal variables at once using dummy variables
DataForML_Numeric=pd.get_dummies(DataForML)

# Adding Target Variable to the data
DataForML_Numeric['price']=DiamondpricesData['price']

# Printing sample rows
DataForML_Numeric.head()
Out[43]:
carat x y z cut color clarity price
0 0.23 3.95 3.98 2.43 4 6 3 326
1 0.21 3.89 3.84 2.31 5 6 2 326
2 0.23 4.05 4.07 2.31 1 6 4 327
3 0.29 4.20 4.23 2.63 5 2 5 334
4 0.31 4.34 4.35 2.75 1 1 3 335
In [ ]:
 

Machine Learning: Splitting the data into Training and Testing sample

We dont use the full data for creating the model. Some data is randomly selected and kept aside for checking how good the model is. This is known as Testing Data and the remaining data is called Training data on which the model is built. Typically 70% of data is used as Training data and the rest 30% is used as Tesing data.

In [44]:
# Printing all the column names for our reference
DataForML_Numeric.columns
Out[44]:
Index(['carat', 'x', 'y', 'z', 'cut', 'color', 'clarity', 'price'], dtype='object')
In [45]:
# Separate Target Variable and Predictor Variables
TargetVariable='price'
Predictors=['carat', 'x', 'y', 'z', 'cut', 'color', 'clarity']

X=DataForML_Numeric[Predictors].values
y=DataForML_Numeric[TargetVariable].values

# Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=428)
In [ ]:
 

Standardization/Normalization of data

You can choose not to run this step if you want to compare the resultant accuracy of this transformation with the accuracy of raw data.

However, if you are using KNN or Neural Networks, then this step becomes necessary.

In [46]:
### Sandardization of data ###
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Choose either standardization or Normalization
# On this data Min Max Normalization produced better results

# Choose between standardization and MinMAx normalization
#PredictorScaler=StandardScaler()
PredictorScaler=MinMaxScaler()

# Storing the fit object for later reference
PredictorScalerFit=PredictorScaler.fit(X)

# Generating the standardized values of X
X=PredictorScalerFit.transform(X)

# Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [47]:
# Sanity check for the sampled data
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(37636, 7)
(37636,)
(16131, 7)
(16131,)
In [ ]:
 

Multiple Linear Regression

In [48]:
# Multiple Linear Regression
from sklearn.linear_model import LinearRegression
RegModel = LinearRegression()

# Printing all the parameters of Linear regression
print(RegModel)

# Creating the model on Training Data
LREG=RegModel.fit(X_train,y_train)
prediction=LREG.predict(X_test)

# Taking the standardized values to original scale


from sklearn import metrics
# Measuring Goodness of fit in Training data
print('R2 Value:',metrics.r2_score(y_train, LREG.predict(X_train)))

###########################################################################
print('\n##### Model Validation and Accuracy Calculations ##########')

# Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=np.round(prediction)

# Printing sample prediction values
print(TestingDataResults[[TargetVariable,'Predicted'+TargetVariable]].head())

# Calculating the error for each row
TestingDataResults['APE']=100 * ((abs(
  TestingDataResults['price']-TestingDataResults['Predictedprice']))/TestingDataResults['price'])

MAPE=np.mean(TestingDataResults['APE'])
MedianMAPE=np.median(TestingDataResults['APE'])

Accuracy =100 - MAPE
MedianAccuracy=100- MedianMAPE
print('Mean Accuracy on test data:', Accuracy) # Can be negative sometimes due to outlier
print('Median Accuracy on test data:', MedianAccuracy)


# Defining a custom function to calculate accuracy
# Make sure there are no zeros in the Target variable if you are using MAPE
def Accuracy_Score(orig,pred):
    MAPE = np.mean(100 * (np.abs(orig-pred)/orig))
    #print('#'*70,'Accuracy:', 100-MAPE)
    return(100-MAPE)

# Custom Scoring MAPE calculation
from sklearn.metrics import make_scorer
custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
R2 Value: 0.8990644042515243

##### Model Validation and Accuracy Calculations ##########
   price  Predictedprice
0   4039          4265.0
1   3239          3589.0
2   6089          6015.0
3   9660          7284.0
4   2326          2614.0
Mean Accuracy on test data: 59.75615085548515
Median Accuracy on test data: 77.36475798797849

Accuracy values for 10-fold Cross Validation:
 [61.5205416  69.61187421 78.8435249  75.85682864 75.4520242  17.0499129
 29.69858832 41.33422654 56.13340688 65.1282487 ]

Final Average Accuracy of the model: 57.06
In [ ]:
 

Decision Trees

In [49]:
# Decision Trees (Multiple if-else statements!)
from sklearn.tree import DecisionTreeRegressor
RegModel = DecisionTreeRegressor(max_depth=10,criterion='mse')
# Good Range of Max_depth = 2 to 20

# Printing all the parameters of Decision Tree
print(RegModel)

# Creating the model on Training Data
DT=RegModel.fit(X_train,y_train)
prediction=DT.predict(X_test)

from sklearn import metrics
# Measuring Goodness of fit in Training data
print('R2 Value:',metrics.r2_score(y_train, DT.predict(X_train)))

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(DT.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

###########################################################################
print('\n##### Model Validation and Accuracy Calculations ##########')

# Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=np.round(prediction)

# Printing sample prediction values
print(TestingDataResults[[TargetVariable,'Predicted'+TargetVariable]].head())

# Calculating the error for each row
TestingDataResults['APE']=100 * ((abs(
  TestingDataResults['price']-TestingDataResults['Predictedprice']))/TestingDataResults['price'])

MAPE=np.mean(TestingDataResults['APE'])
MedianMAPE=np.median(TestingDataResults['APE'])

Accuracy =100 - MAPE
MedianAccuracy=100- MedianMAPE
print('Mean Accuracy on test data:', Accuracy) # Can be negative sometimes due to outlier
print('Median Accuracy on test data:', MedianAccuracy)


# Defining a custom function to calculate accuracy
# Make sure there are no zeros in the Target variable if you are using MAPE
def Accuracy_Score(orig,pred):
    MAPE = np.mean(100 * (np.abs(orig-pred)/orig))
    #print('#'*70,'Accuracy:', 100-MAPE)
    return(100-MAPE)

# Custom Scoring MAPE calculation
from sklearn.metrics import make_scorer
custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))
DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
R2 Value: 0.9813787032743522

##### Model Validation and Accuracy Calculations ##########
   price  Predictedprice
0   4039          3163.0
1   3239          3682.0
2   6089          6620.0
3   9660         10207.0
4   2326          2564.0
Mean Accuracy on test data: 90.67854234192478
Median Accuracy on test data: 92.82798292818705

Accuracy values for 10-fold Cross Validation:
 [81.16854675 84.75515584 84.21967067 81.42858773 77.56498826 81.22885375
 83.36795182 79.16804778 80.62930613 82.33546148]

Final Average Accuracy of the model: 81.59

Plotting a Decision Tree

In [50]:
# Installing the required library for plotting the decision tree
# Make sure to run all three commands
# 1. Open anaconda Prompt
# pip install graphviz
# conda install graphviz
# pip install pydotplus
In [51]:
# Adding graphviz path to the PATH env variable
# Try to find "dot.exe" in your system and provide the path of that folder
import os
os.environ["PATH"] += os.pathsep + 'C:\\Users\\fhashmi\\AppData\\Local\\Continuum\\Anaconda3\\Library\\bin\\graphviz'
In [2]:
# max_depth=10 is too large to plot here

# Load libraries
#from IPython.display import Image
#from sklearn import tree
#import pydotplus

# Create DOT data
#dot_data = tree.export_graphviz(RegModel, out_file=None, 
#                                feature_names=Predictors, class_names=TargetVariable)

# printing the rules
#print(dot_data)

# Draw graph
#graph = pydotplus.graph_from_dot_data(dot_data)

# Show graph
#Image(graph.create_png(), width=5000,height=5000)
# Double click on the graph to zoom in
# This graph with max_depth=10 is too large to plot!!
In [ ]:
 

Random Forest

In [53]:
# Random Forest (Bagging of multiple Decision Trees)
from sklearn.ensemble import RandomForestRegressor
RegModel = RandomForestRegressor(max_depth=5, n_estimators=100,criterion='mse')
# Good range for max_depth: 2-10 and n_estimators: 100-1000

# Printing all the parameters of Random Forest
print(RegModel)

# Creating the model on Training Data
RF=RegModel.fit(X_train,y_train)
prediction=RF.predict(X_test)

from sklearn import metrics
# Measuring Goodness of fit in Training data
print('R2 Value:',metrics.r2_score(y_train, RF.predict(X_train)))

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(RF.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

###########################################################################
print('\n##### Model Validation and Accuracy Calculations ##########')

# Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=np.round(prediction)

# Printing sample prediction values
print(TestingDataResults[[TargetVariable,'Predicted'+TargetVariable]].head())

# Calculating the error for each row
TestingDataResults['APE']=100 * ((abs(
  TestingDataResults['price']-TestingDataResults['Predictedprice']))/TestingDataResults['price'])

MAPE=np.mean(TestingDataResults['APE'])
MedianMAPE=np.median(TestingDataResults['APE'])

Accuracy =100 - MAPE
MedianAccuracy=100- MedianMAPE
print('Mean Accuracy on test data:', Accuracy) # Can be negative sometimes due to outlier
print('Median Accuracy on test data:', MedianAccuracy)


# Defining a custom function to calculate accuracy
# Make sure there are no zeros in the Target variable if you are using MAPE
def Accuracy_Score(orig,pred):
    MAPE = np.mean(100 * (np.abs(orig-pred)/orig))
    #print('#'*70,'Accuracy:', 100-MAPE)
    return(100-MAPE)

# Custom Scoring MAPE calculation
from sklearn.metrics import make_scorer
custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)
R2 Value: 0.9470092716125311

##### Model Validation and Accuracy Calculations ##########
   price  Predictedprice
0   4039          3192.0
1   3239          3192.0
2   6089          7298.0
3   9660         10891.0
4   2326          2337.0
Mean Accuracy on test data: 84.19329966671175
Median Accuracy on test data: 87.74360178445644

Accuracy values for 10-fold Cross Validation:
 [73.76144119 82.00197953 80.17908096 73.41178631 77.93593939 79.42206359
 78.89455477 71.23453003 79.56507287 78.41238645]

Final Average Accuracy of the model: 77.48
In [ ]:
 

Plotting one of the Decision Trees in Random Forest

In [4]:
#max_depth=5 is too large to plot here

# Plotting a single Decision Tree from Random Forest
# Load libraries
#from IPython.display import Image
#from sklearn import tree
#import pydotplus

# Create DOT data for the 6th Decision Tree in Random Forest
#dot_data = tree.export_graphviz(RegModel.estimators_[5] , out_file=None, feature_names=Predictors, class_names=TargetVariable)

# Draw graph
#graph = pydotplus.graph_from_dot_data(dot_data)

# Show graph
#Image(graph.create_png(), width=500,height=500)
# Double click on the graph to zoom in
In [ ]:
 

AdaBoost

In [55]:
# Adaboost (Boosting of multiple Decision Trees)
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

# Choosing Decision Tree with 1 level as the weak learner
DTR=DecisionTreeRegressor(max_depth=10)
RegModel = AdaBoostRegressor(n_estimators=100, base_estimator=DTR ,learning_rate=0.01)

# Printing all the parameters of Adaboost
print(RegModel)

# Creating the model on Training Data
AB=RegModel.fit(X_train,y_train)
prediction=AB.predict(X_test)

from sklearn import metrics
# Measuring Goodness of fit in Training data
print('R2 Value:',metrics.r2_score(y_train, AB.predict(X_train)))

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(AB.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')

###########################################################################
print('\n##### Model Validation and Accuracy Calculations ##########')

# Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=np.round(prediction)

# Printing sample prediction values
print(TestingDataResults[[TargetVariable,'Predicted'+TargetVariable]].head())

# Calculating the error for each row
TestingDataResults['APE']=100 * ((abs(
  TestingDataResults['price']-TestingDataResults['Predictedprice']))/TestingDataResults['price'])

MAPE=np.mean(TestingDataResults['APE'])
MedianMAPE=np.median(TestingDataResults['APE'])

Accuracy =100 - MAPE
MedianAccuracy=100- MedianMAPE
print('Mean Accuracy on test data:', Accuracy) # Can be negative sometimes due to outlier
print('Median Accuracy on test data:', MedianAccuracy)


# Defining a custom function to calculate accuracy
# Make sure there are no zeros in the Target variable if you are using MAPE
def Accuracy_Score(orig,pred):
    MAPE = np.mean(100 * (np.abs(orig-pred)/orig))
    #print('#'*70,'Accuracy:', 100-MAPE)
    return(100-MAPE)

# Custom Scoring MAPE calculation
from sklearn.metrics import make_scorer
custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))
AdaBoostRegressor(base_estimator=DecisionTreeRegressor(criterion='mse',
                                                       max_depth=10,
                                                       max_features=None,
                                                       max_leaf_nodes=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                                                       presort=False,
                                                       random_state=None,
                                                       splitter='best'),
                  learning_rate=0.01, loss='linear', n_estimators=100,
                  random_state=None)
R2 Value: 0.9850378415424257

##### Model Validation and Accuracy Calculations ##########
   price  Predictedprice
0   4039          3232.0
1   3239          3237.0
2   6089          6440.0
3   9660         10063.0
4   2326          2441.0
Mean Accuracy on test data: 91.31191550048814
Median Accuracy on test data: 93.35832886984467

Accuracy values for 10-fold Cross Validation:
 [81.75253683 85.58998051 84.74645796 81.94931151 78.37558189 81.32958033
 84.41150174 81.28810989 81.42746132 82.09531643]

Final Average Accuracy of the model: 82.3

Plotting one of the Decision trees from Adaboost

In [5]:
# max_depth=10 is too large to plot here

# PLotting 5th single Decision Tree from Adaboost
# Load libraries
#from IPython.display import Image
#from sklearn import tree
#import pydotplus

# Create DOT data for the 6th Decision Tree in Random Forest
#dot_data = tree.export_graphviz(RegModel.estimators_[5] , out_file=None, feature_names=Predictors, class_names=TargetVariable)

# Draw graph
#graph = pydotplus.graph_from_dot_data(dot_data)

# Show graph
#Image(graph.create_png(), width=500,height=500)
# Use a smaller value of max_depth if you wish to plot it here!
In [ ]:
 

XGBoost

In [57]:
# Xtreme Gradient Boosting (XGBoost)
from xgboost import XGBRegressor
RegModel=XGBRegressor(max_depth=2, 
                      learning_rate=0.1, 
                      n_estimators=1000, 
                      objective='reg:linear', 
                      booster='gbtree')

# Printing all the parameters of XGBoost
print(RegModel)

# Creating the model on Training Data
XGB=RegModel.fit(X_train,y_train)
prediction=XGB.predict(X_test)

from sklearn import metrics
# Measuring Goodness of fit in Training data
print('R2 Value:',metrics.r2_score(y_train, XGB.predict(X_train)))

# Plotting the feature importance for Top 10 most important columns
%matplotlib inline
feature_importances = pd.Series(XGB.feature_importances_, index=Predictors)
feature_importances.nlargest(10).plot(kind='barh')
###########################################################################
print('\n##### Model Validation and Accuracy Calculations ##########')

# Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=np.round(prediction)

# Printing sample prediction values
print(TestingDataResults[[TargetVariable,'Predicted'+TargetVariable]].head())

# Calculating the error for each row
TestingDataResults['APE']=100 * ((abs(
  TestingDataResults['price']-TestingDataResults['Predictedprice']))/TestingDataResults['price'])


MAPE=np.mean(TestingDataResults['APE'])
MedianMAPE=np.median(TestingDataResults['APE'])

Accuracy =100 - MAPE
MedianAccuracy=100- MedianMAPE
print('Mean Accuracy on test data:', Accuracy) # Can be negative sometimes due to outlier
print('Median Accuracy on test data:', MedianAccuracy)


# Defining a custom function to calculate accuracy
# Make sure there are no zeros in the Target variable if you are using MAPE
def Accuracy_Score(orig,pred):
    MAPE = np.mean(100 * (np.abs(orig-pred)/orig))
    #print('#'*70,'Accuracy:', 100-MAPE)
    return(100-MAPE)

# Custom Scoring MAPE calculation
from sklearn.metrics import make_scorer
custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
             max_depth=2, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=True, subsample=1)
R2 Value: 0.9779276198934674

##### Model Validation and Accuracy Calculations ##########
   price  Predictedprice
0   4039          3475.0
1   3239          3291.0
2   6089          6366.0
3   9660          8690.0
4   2326          2641.0
Mean Accuracy on test data: 85.92081928253174
Median Accuracy on test data: 91.28440380096436

Accuracy values for 10-fold Cross Validation:
 [86.62718202 87.51281994 88.81938537 86.4094723  86.30346639 78.02564869
 75.84798022 77.50766317 85.11681295 87.18975147]

Final Average Accuracy of the model: 83.94

Plotting a single Decision tree out of XGBoost

In [58]:
from xgboost import plot_tree
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(20, 8))
plot_tree(XGB, num_trees=10, ax=ax)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x121f9b910>
In [ ]:
 

KNN

In [59]:
# K-Nearest Neighbor(KNN)
from sklearn.neighbors import KNeighborsRegressor
RegModel = KNeighborsRegressor(n_neighbors=4)

# Printing all the parameters of KNN
print(RegModel)

# Creating the model on Training Data
KNN=RegModel.fit(X_train,y_train)
prediction=KNN.predict(X_test)

from sklearn import metrics
# Measuring Goodness of fit in Training data
print('R2 Value:',metrics.r2_score(y_train, KNN.predict(X_train)))

# Plotting the feature importance for Top 10 most important columns
# The variable importance chart is not available for KNN

###########################################################################
print('\n##### Model Validation and Accuracy Calculations ##########')

# Printing some sample values of prediction
TestingDataResults=pd.DataFrame(data=X_test, columns=Predictors)
TestingDataResults[TargetVariable]=y_test
TestingDataResults[('Predicted'+TargetVariable)]=np.round(prediction)

# Printing sample prediction values
print(TestingDataResults[[TargetVariable,'Predicted'+TargetVariable]].head())

# Calculating the error for each row
TestingDataResults['APE']=100 * ((abs(
  TestingDataResults['price']-TestingDataResults['Predictedprice']))/TestingDataResults['price'])

MAPE=np.mean(TestingDataResults['APE'])
MedianMAPE=np.median(TestingDataResults['APE'])

Accuracy =100 - MAPE
MedianAccuracy=100- MedianMAPE
print('Mean Accuracy on test data:', Accuracy) # Can be negative sometimes due to outlier
print('Median Accuracy on test data:', MedianAccuracy)

# Defining a custom function to calculate accuracy
# Make sure there are no zeros in the Target variable if you are using MAPE
def Accuracy_Score(orig,pred):
    MAPE = np.mean(100 * (np.abs(orig-pred)/orig))
    #print('#'*70,'Accuracy:', 100-MAPE)
    return(100-MAPE)

# Custom Scoring MAPE calculation
from sklearn.metrics import make_scorer
custom_Scoring=make_scorer(Accuracy_Score, greater_is_better=True)

# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(RegModel, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=4, p=2,
                    weights='uniform')
R2 Value: 0.9851655402905799

##### Model Validation and Accuracy Calculations ##########
   price  Predictedprice
0   4039          3422.0
1   3239          3310.0
2   6089          5906.0
3   9660          9823.0
4   2326          2538.0
Mean Accuracy on test data: 91.97592905243795
Median Accuracy on test data: 94.44005061689339

Accuracy values for 10-fold Cross Validation:
 [84.09286401 86.94095896 86.20508824 83.91403775 81.31303854 82.43343805
 83.90022242 83.11032154 83.66532689 84.49467018]

Final Average Accuracy of the model: 84.01
In [ ]:
 

Deployment of the Model

Based on the above trials you select that algorithm which produces the best average accuracy. In this case, multiple algorithms have produced similar kind of average accuracy. Hence, we can choose any one of them.

I am choosing KNN as the final model since it is very fast for this data!

In order to deploy the model we follow below steps

  1. Train the model using 100% data available
  2. Save the model as a serialized file which can be stored anywhere
  3. Create a python function which gets integrated with front-end(Tableau/Java Website etc.) to take all the inputs and returns the prediction

Choosing only the most important variables

Its beneficial to keep lesser number of predictors for the model while deploying it in production. The lesser predictors you keep, the better because, the model will be less dependent hence, more stable.

This is important specially when the data is high dimensional(too many predictor columns).

In this data, the most important predictor variables are 'carat','y', 'color' and 'clarity'

As these are consistently on top of the variable importance chart for every algorithm. Hence choosing these as final set of predictor variables.

In [62]:
# Separate Target Variable and Predictor Variables
TargetVariable='price'

# Selecting the final set of predictors for the deployment
# Based on the variable importance charts of multiple algorithms above
Predictors=['carat','y', 'color' , 'clarity']

X=DataForML_Numeric[Predictors].values
y=DataForML_Numeric[TargetVariable].values

### Sandardization of data ###
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Choose either standardization or Normalization
# On this data Min Max Normalization produced better results

# Choose between standardization and MinMAx normalization
#PredictorScaler=StandardScaler()
PredictorScaler=MinMaxScaler()

# Storing the fit object for later reference
PredictorScalerFit=PredictorScaler.fit(X)

# Generating the standardized values of X
X=PredictorScalerFit.transform(X)

print(X.shape)
print(y.shape)
(53767, 4)
(53767,)
In [ ]:
 

Step 1. Retraining the model using 100% data

In [63]:
# K-Nearest Neighbor(KNN)
from sklearn.neighbors import KNeighborsRegressor
RegModel = KNeighborsRegressor(n_neighbors=4)

# Training the model on 100% Data available
Final_KNN_Model=RegModel.fit(X,y)

Cross validating the final model accuracy with less predictors

In [64]:
# Importing cross validation function from sklearn
from sklearn.model_selection import cross_val_score

# Running 10-Fold Cross validation on a given algorithm
# Passing full data X and y because the K-fold will split the data and automatically choose train/test
Accuracy_Values=cross_val_score(Final_KNN_Model, X , y, cv=10, scoring=custom_Scoring)
print('\nAccuracy values for 10-fold Cross Validation:\n',Accuracy_Values)
print('\nFinal Average Accuracy of the model:', round(Accuracy_Values.mean(),2))
Accuracy values for 10-fold Cross Validation:
 [83.93745166 86.68504646 86.13021249 83.69627634 81.28995539 79.79591784
 82.66804902 82.99292523 83.88674132 84.14375815]

Final Average Accuracy of the model: 83.52

Step 2. Save the model as a serialized file which can be stored anywhere

In [65]:
import pickle
import os

# Saving the Python objects as serialized files can be done using pickle library
# Here let us save the Final model
with open('Final_KNN_Model.pkl', 'wb') as fileWriteStream:
    pickle.dump(Final_KNN_Model, fileWriteStream)
    # Don't forget to close the filestream!
    fileWriteStream.close()
    
print('pickle file of Predictive Model is saved at Location:',os.getcwd())
pickle file of Predictive Model is saved at Location: /Users/farukh/Python Case Studies

Step 3. Create a python function

In [69]:
# This Function can be called from any from any front end tool/website
def FunctionPredictResult(InputData):
    import pandas as pd
    Num_Inputs=InputData.shape[0]
    
    # Making sure the input data has same columns as it was used for training the model
    # Also, if standardization/normalization was done, then same must be done for new input
    
    # Appending the new data with the Training data
    DataForML=pd.read_pickle('DataForML.pkl')
    InputData=InputData.append(DataForML)
    
    # Treating ordinal variables
    # Replacing the ordinal values of color
    InputData['color'].replace({  'J':1, 
                                  'I':2,
                                  'H':3,
                                  'G':4,
                                  'F':5,
                                  'E':6,
                                  'D':7
                                 }, inplace=True)
    
    # Replacing the ordinal values for clarity
    InputData['clarity'].replace({'I1':1,
                                  'SI1':2,
                                  'SI2':3,
                                  'VS1':4,
                                  'VS2':5,
                                  'VVS1':6,
                                  'VVS2':7,
                                  'IF':8
                                 }, inplace=True)
    
    # Generating dummy variables for rest of the nominal variables
    InputData=pd.get_dummies(InputData)
            
    # Maintaining the same order of columns as it was during the model training
    Predictors=['carat','y', 'color' , 'clarity']
    
    # Generating the input values to the model
    X=InputData[Predictors].values[0:Num_Inputs]
    
    # Generating the standardized values of X since it was done while model training also
    X=PredictorScalerFit.transform(X)
    
    # Loading the Function from pickle file
    import pickle
    with open('Final_KNN_Model.pkl', 'rb') as fileReadStream:
        PredictionModel=pickle.load(fileReadStream)
        # Don't forget to close the filestream!
        fileReadStream.close()
            
    # Genprice Predictions
    Prediction=PredictionModel.predict(X)
    PredictionResult=pd.DataFrame(Prediction, columns=['Prediction'])
    return(PredictionResult)
In [70]:
# Calling the function for new sample data
NewSampleData=pd.DataFrame(
data=[[0.23,3.98,'E','SI2'],
     [0.29, 4.23,'I','VS2']],
columns=['carat','y', 'color' , 'clarity'])

print(NewSampleData)

# Calling the Function for prediction
FunctionPredictResult(InputData= NewSampleData)
   carat     y color clarity
0   0.23  3.98     E     SI2
1   0.29  4.23     I     VS2
Out[70]:
Prediction
0 401.25
1 412.50

The Function FunctionPredictResult() can be used to produce the predictions for one or more cases at a time. Hence, it can be scheduled using a batch job or cron job to run every night and generate predictions for all the loan applications available in the system.

In [ ]:
 

Deploying a predictive model as an API

  • Django and flask are two popular ways to deploy predictive models as a web service
  • You can call your predictive models using a URL from any front end like tableau, java or angular js

Creating the model with few parameters

Function for predictions API

In [71]:
# Creating the function which can take inputs and return prediction
def FunctionGeneratePrediction(inp_carat, inp_y, inp_color, inp_clarity):
    
    # Creating a data frame for the model input
    SampleInputData=pd.DataFrame(
     data=[[inp_carat , inp_y, inp_color, inp_clarity]],
     columns=['carat','y', 'color', 'clarity'])

    # Calling the function defined above using the input parameters
    Predictions=FunctionPredictResult(InputData= SampleInputData)

    # Returning the predicted value
    return(Predictions.to_json())

# Function call
FunctionGeneratePrediction(  inp_carat=0.29,
                             inp_y =4.23,
                             inp_color='I',
                             inp_clarity='VS2'
                             )
Out[71]:
'{"Prediction":{"0":412.5}}'
In [ ]:
 
In [72]:
# Installing the flask library required to create the API
#!pip install flask

Creating Flask API

In [73]:
from flask import Flask, request, jsonify
import pickle
import pandas as pd
import numpy
In [74]:
app = Flask(__name__)

@app.route('/prediction_api', methods=["GET"])
def prediction_api():
    try:
        # Getting the paramters from API call
        carat_value = float(request.args.get('carat'))
        y_value=float(request.args.get('y'))
        color_value=request.args.get('color')
        clarity_value=request.args.get('clarity')
                
        # Calling the funtion to get predictions
        prediction_from_api=FunctionGeneratePrediction(
                                                         inp_carat=carat_value,
                                                         inp_y =y_value,
                                                         inp_color=color_value,
                                                         inp_clarity=clarity_value
                                                      )

        return (prediction_from_api)
    
    except Exception as e:
        return('Something is not right!:'+str(e))

Starting the API engine

In [75]:
import os
if __name__ =="__main__":
    
    # Hosting the API in localhost
    app.run(host='127.0.0.1', port=8080, threaded=True, debug=True, use_reloader=False)
    # Interrupt kernel to stop the API
 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://127.0.0.1:8080/ (Press CTRL+C to quit)
127.0.0.1 - - [20/Sep/2020 00:21:35] "GET /prediction_api?carat=0.29&y=4.23&color=I&clarity=VS2 HTTP/1.1" 200 -

Sample URL to call the API

This URL can be called by any front end application like Java, Tableau etc. Once the parameters are passed to it, the predictions will be generated.